{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chinese Name Gender Predictor\n",
    "James Chan 2018"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from xpinyin import Pinyin\n",
    "import pandas as pd\n",
    "from keras.layers import LSTM, Dense\n",
    "from keras.models import Sequential\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "The Chinese writing system is made up of individual characters famous for their elegant strokes and detailed complexity.  While most recognize a Chinese character when they see one, few knows about the existence of a romanized spelling system known as Pinyin, which translates Chinese characters to alphabets and vice versa.  \n",
    "\n",
    "\"The pinyin system was developed in the 1950s by many linguists, including Zhou Youguang, based on earlier form romanizations of Chinese. It was published by the Chinese government in 1958 and revised several times\".[1]\n",
    "\n",
    "![](files/img/bruce.png)\n",
    "<div align=\"center\">Figure 1. Bruce Lee's Chinese Name in Pinyin</div>\n",
    "\n",
    "The distinction between a female  and a male name in Chinese is more subtle than a typical American English name.  For example John and Mary in most cases belong to names of a male and female respectively, whereas the gender for a name like Xiao-Ming is practically a coin-toss. However, in most cases there are subtle patterns that leans towards one gender than the other.\n",
    "\n",
    "In this project, I have harnessed the power of deep learning to train a gender predictor for Chinese names using ~9800 Chinese name samples[2].  We hope that the gender predictor will perform at least as good as human.  \n",
    "\n",
    "#### Table of Content\n",
    "1. Framing the Problem\n",
    "2. Data Preprocessing\n",
    "3. Train the Model\n",
    "4. Evaluate the Model\n",
    "5. Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Framing the Problem\n",
    "We first strip the last name from our 9800 Chinese names because they have negligible correlation (if any) with the gender of a person.  Once we removed the last name, our Pinyin representation of chinese names are basically in alphabets.  By converting these alphabets into one-hot-vectors and assigning each with a timestamp, our training examples are ready to be fed into an LSTM model.  In the example below, t denotes each timestep and T denotes the total number of timesteps.\n",
    "\n",
    "![](files/img/feed.png)\n",
    "<div align=\"center\">Figure 2. Data Processing for Deep Learning (LSTM)</div>\n",
    "\n",
    "It is worth mentioning that this scheme is a many-to-one mapping, which means the input feature has a varying length timestamp with a single binary target output - gender."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data Processing\n",
    "Here we process the the raw data per Figure 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#the conversion from Chinese character to Pinyin equivalent is taken care by the utility below [3]\n",
    "p = Pinyin()\n",
    "df = pd.read_csv('ChineseNames.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#change headers to english\n",
    "df.columns = ['Name', 'Gender']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#strip last name\n",
    "def strip_last_name(name):\n",
    "    return name[1:]\n",
    "\n",
    "def convert_to_pinyin(name):\n",
    "    return p.get_pinyin(name)\n",
    "\n",
    "def process_name(name):\n",
    "    name = strip_last_name(name)\n",
    "    name = convert_to_pinyin(name)\n",
    "    return name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     Name Gender\n",
      "9792  左婉怡      F\n",
      "9793  左烜晅      F\n",
      "9794  左雨晴      F\n",
      "9795   左越      F\n",
      "9796  左子烨      F\n",
      "           Name Gender\n",
      "9792     wan-yi      F\n",
      "9793  xuan-xuan      F\n",
      "9794    yu-qing      F\n",
      "9795        yue      F\n",
      "9796      zi-ye      F\n"
     ]
    }
   ],
   "source": [
    "#visualize before after processing\n",
    "print(df.tail(5))\n",
    "df['Name'] = df['Name'].apply(process_name)\n",
    "print(df.tail(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first five entries are raw data.  The last five entries were the Pinyin equivalent with last name stripped.  This looks correct.  Next we build a dictionary by scanning all the characters appeared.  If everything goes according to plan, there should only be 27 characters - 26 letters and a dash."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#get all unique chars\n",
    "characters = {}\n",
    "for i, name in enumerate(df['Name']):\n",
    "    for char in name:\n",
    "        if char in characters:\n",
    "            characters[char] += 1\n",
    "        else:\n",
    "            characters[char] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "27"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#show number of unique characters\n",
    "len(characters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "27 is the correct number as discussed above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#find the longest name\n",
    "maxlen = df['Name'].str.len().max()\n",
    "maxlen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the longest name to limit the length of our time-series, which in this case is 16.  We then convert our data into time-series one-hot-vectors per our dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create mapping.  though it would be nice if they mapping is alphabetically ordered, but it doesn't matter mathematically.\n",
    "idx = 0\n",
    "for c in characters.keys():\n",
    "    characters[c] = idx\n",
    "    idx += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create training example of dimension (example #, timestep, # of features (length of 1-hot vector))\n",
    "num_examples = df.shape[0]\n",
    "time_steps = maxlen\n",
    "num_features = len(characters)\n",
    "def char_to_vec(char):\n",
    "    vector = np.zeros((1, num_features), dtype=int)\n",
    "    idx = characters[char]\n",
    "    vector[0,idx] = 1\n",
    "    return vector\n",
    "\n",
    "def name_to_vec(name):\n",
    "    example = np.zeros((maxlen, num_features), dtype = int)\n",
    "    for i in range(maxlen):\n",
    "        if i < len(name):\n",
    "            char = name[i]\n",
    "            vector = char_to_vec(char)\n",
    "            example[i,:] = vector[0,:]\n",
    "        else:\n",
    "            example[i,:] = np.zeros((num_features),dtype=int)\n",
    "    return example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#convert all examples to vector format\n",
    "X = np.empty((df.shape[0], maxlen, num_features))\n",
    "for i, name in enumerate(df['Name']):\n",
    "    example = name_to_vec(name)\n",
    "    X[i,:,:] = example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Y = df['Gender'].values\n",
    "Y[Y == 'M'] = 1\n",
    "Y[Y == 'F'] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(9797, 16, 27)\n",
      "(9797,)\n"
     ]
    }
   ],
   "source": [
    "print(X.shape)\n",
    "print(Y.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These dimensions look correct.  We have about 9800 name entries.  The timestamp is 16 in length, and our one-hot-vector has 27 unique positions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Train the Model\n",
    "The parameters below were determined via manual tuning.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = Sequential()\n",
    "model.add(LSTM(64, input_shape=(maxlen, num_features), return_sequences=False, dropout=0.2, recurrent_dropout=0.2))\n",
    "for _ in range(8):\n",
    "    model.add(Dense(64, activation='tanh'))\n",
    "model.add(Dense(1, activation='tanh'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/10\n",
      "9797/9797 [==============================] - 10s 1ms/step - loss: 0.7146 - acc: 0.5363\n",
      "Epoch 2/10\n",
      "9797/9797 [==============================] - 4s 358us/step - loss: 0.6771 - acc: 0.5762\n",
      "Epoch 3/10\n",
      "9797/9797 [==============================] - 4s 363us/step - loss: 0.6629 - acc: 0.6016 \n",
      "Epoch 4/10\n",
      "9797/9797 [==============================] - 4s 359us/step - loss: 0.6566 - acc: 0.6175\n",
      "Epoch 5/10\n",
      "9797/9797 [==============================] - 4s 428us/step - loss: 0.6570 - acc: 0.6152\n",
      "Epoch 6/10\n",
      "9797/9797 [==============================] - 4s 386us/step - loss: 0.6544 - acc: 0.6193\n",
      "Epoch 7/10\n",
      "9797/9797 [==============================] - 3s 357us/step - loss: 0.6548 - acc: 0.6143 1s - \n",
      "Epoch 8/10\n",
      "9797/9797 [==============================] - 3s 348us/step - loss: 0.6528 - acc: 0.6205 0s - loss: 0.65\n",
      "Epoch 9/10\n",
      "9797/9797 [==============================] - 3s 356us/step - loss: 0.6517 - acc: 0.6234\n",
      "Epoch 10/10\n",
      "9797/9797 [==============================] - 4s 384us/step - loss: 0.6471 - acc: 0.6197\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x206b3323eb8>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X,Y, batch_size=64, epochs=10, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Evaluate the Model\n",
    "With just 10 epochs of training, the model reached an accruacy of low 60%, which doesn't seem that great, but by increasing the number of epochs, there also isn't a noticeable improvement in accuracy either.  Low 60% was about as high as I could get with my model, which is within reasonable expectation.  To further evaluate the accuracy of the model, we will take some real life examples. This step is sectioned into 3 parts:\n",
    "1. Gender Prediction of Famous People\n",
    "2. Gender Prediction of My Grade School Classmates\n",
    "3. Gender Prediction of My Immediate Family Members"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 1 Gender Prediction of Famous People [82% Accuracy]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#let's test on some real life examples:\n",
    "X_test = np.empty((0, maxlen, num_features), dtype=int)\n",
    "Y_test = np.empty((0), dtype=int)\n",
    "people = ['long', #jackie chan\n",
    "          'xiao-long', #bruce lee\n",
    "          'bing-bing', #fan bingbing\n",
    "          'zi-yi', #zhang ziyi\n",
    "          'zi-dan', #donnie yen\n",
    "          'lian-jie', #jet li\n",
    "          'yu-ling', #lucy liu\n",
    "          'en-mei', #amy tan\n",
    "          'ming-na', #ming-na wen\n",
    "          'ze-dong', #chairman mao\n",
    "         'jin-ping'] #chairman xi\n",
    "Y_test = np.array([1,1,0,0,1,1,0,0,0,1,1])\n",
    "for name in people:\n",
    "    array = name_to_vec(name)\n",
    "    X_test = np.concatenate((X_test, np.expand_dims(array, 0)), 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mis-classified:  bing-bing\n",
      "mis-classified:  lian-jie\n",
      "\n",
      "accuracy:  0.8181818181818182\n"
     ]
    }
   ],
   "source": [
    "#predict\n",
    "predictions = model.predict_classes(X_test)\n",
    "for i, prediction in enumerate(predictions):\n",
    "    if prediction != Y_test[i]:\n",
    "        print('mis-classified: ', people[i])\n",
    "print()\n",
    "print('accuracy: ', sum(predictions[:,0] == Y_test)/Y_test.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is more impressive than I expected even after just 10 epochs of training. 82% accuracy is VERY HIGH for classifying chinese names.  I would have expected Bing Bing to be classified as female since any name ending in -ing is more commonly female than male in my experience.  I am however, not at all surprised that Jet Li were mis-classified since Lian tends to be more on the feminine side in my experience"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 2 Gender Prediction of My Grade School Classmates [82% Accuracy]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#some of the names that comes to mind from grade school\n",
    "X_test = np.empty((0, maxlen, num_features), dtype=int)\n",
    "Y_test = np.empty((0), dtype=int)\n",
    "people = ['ting-pei',\n",
    "          'xin',\n",
    "          'jin-hao',\n",
    "          'zhe-an',\n",
    "          'yi-cheng',\n",
    "          'zi-jun',\n",
    "          'zhi-hao',\n",
    "          'wei-han',\n",
    "          'guan-yu',\n",
    "          'xiu-qi',\n",
    "          'jun-de']\n",
    "Y_test = np.array([0,0,1,1,1,1,1,1,0,1,1])\n",
    "for name in people:\n",
    "    array = name_to_vec(name)\n",
    "    X_test = np.concatenate((X_test, np.expand_dims(array, 0)), 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mis-classified:  ting-pei\n",
      "mis-classified:  xiu-qi\n",
      "\n",
      "accuracy:  0.8181818181818182\n"
     ]
    }
   ],
   "source": [
    "#predict\n",
    "predictions = model.predict_classes(X_test)\n",
    "for i, prediction in enumerate(predictions):\n",
    "    if prediction != Y_test[i]:\n",
    "        print('mis-classified: ', people[i])\n",
    "print()\n",
    "print('accuracy: ', sum(predictions[:,0] == Y_test)/Y_test.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again we have 81% accuracy, which is awesome!  I was surprised that ting-pei was classified as male, because ting by itself will classify as female and pei is a more commonly a feminine name base on my experience. I knew xiu-qi were going to be mis-classified right off the bat and it was. I am a bit upset that the model DID NOT mis-classify zi-jun, because most humans would likely to have mis-classified this one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 3 Gender Prediction of My Immediate Family Members [80% Accuracy]\n",
    "The model also achieved a whopping 80% accuracy for 15 of my close family members. Their names are withheld for privacy purposes. 3 of whom were mis-classified, and out of those 3, 2 of them are known to have names that tend to be mis-classified even by human."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Conclusion\n",
    "I am extremely pleased with the result.  80% accuracy means 4 out of 5 names will be classified correctly.  In my experience, this is about the right amount because while most people will go with names that are of statistical norm to their gender, a fair amount of people also tend to deviate from that.  \n",
    "\n",
    "If you are familiar with Chinese names, please help me by taking the randomly generated quiz below and let me know if you did better than my A.I. Gender Predictor! Please share your result with jchan70@gatech.edu."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Application\n",
    "Advertisement and recommendation system are such applications that can greatly benefit from a gender predictor.  When the gender of a user is unavailable, but the nationality is identified as Chinese, we may predict with a fairly high certain the gender of the user.  Products such as make-up and purse are highly favored statistically by females, whereas items such as boxing gloves are statistically unattractive to females.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Randomly Generated Quiz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4740          wei-jie\n",
       "9490              mei\n",
       "4550             qian\n",
       "372          tian-hui\n",
       "8803    liu-meng-ying\n",
       "5891           jia-yi\n",
       "361          hao-yang\n",
       "7663             yuan\n",
       "1099            shuai\n",
       "5304               yi\n",
       "5407         dan-yang\n",
       "604             cheng\n",
       "6595         zhuo-wan\n",
       "1589          shu-ren\n",
       "2451          yi-long\n",
       "4863          zi-jian\n",
       "9398          yu-ting\n",
       "1625         xiao-fei\n",
       "940         ying-feng\n",
       "2349        shao-qian\n",
       "Name: Name, dtype: object"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quiz = df.sample(20)\n",
    "quiz['Name']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quiz Answers (Spolier Alert)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4740    1\n",
       "9490    0\n",
       "4550    1\n",
       "372     1\n",
       "8803    0\n",
       "5891    0\n",
       "361     1\n",
       "7663    0\n",
       "1099    1\n",
       "5304    0\n",
       "5407    0\n",
       "604     1\n",
       "6595    0\n",
       "1589    1\n",
       "2451    1\n",
       "4863    1\n",
       "9398    0\n",
       "1625    1\n",
       "940     1\n",
       "2349    1\n",
       "Name: Gender, dtype: object"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quiz['Gender']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reference:\n",
    "1. Wikipedia https://en.wikipedia.org/w/index.php?title=Pinyin&oldid=856531498\n",
    "2. Dataset: https://www.researchgate.net/publication/269630594_9800_Chinese_Names_with_Gender\n",
    "3. Chinese to Pinyin Translator: https://github.com/lxneng/xpinyin\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
